Classifying disease outbreak reports using n-grams and semantic features

نویسندگان

  • Mike Conway
  • Son Doan
  • Ai Kawazoe
  • Nigel Collier
چکیده

INTRODUCTION This paper explores the benefits of using n-grams and semantic features for the classification of disease outbreak reports, in the context of the BioCaster disease outbreak report text mining system. A novel feature of this work is the use of a general purpose semantic tagger - the USAS tagger - to generate features. BACKGROUND We outline the application context for this work (the BioCaster epidemiological text mining system), before going on to describe the experimental data used in our classification experiments (the 1000 document BioCaster corpus). FEATURE SETS: Three broad groups of features are used in this work: Named Entity based features, n-gram features, and features derived from the USAS semantic tagger. METHODOLOGY Three standard machine learning algorithms - Naïve Bayes, the Support Vector Machine algorithm, and the C4.5 decision tree algorithm - were used for classifying experimental data (that is, the BioCaster corpus). Feature selection was performed using the chi(2) feature selection algorithm. Standard text classification performance metrics - Accuracy, Precision, Recall, Specificity and F-score - are reported. RESULTS A feature representation composed of unigrams, bigrams, trigrams and features derived from a semantic tagger, in conjunction with the Naïve Bayes algorithm and feature selection yielded the highest classification accuracy (and F-score). This result was statistically significant compared to a baseline unigram representation and to previous work on the same task. However, it was feature selection rather than semantic tagging that contributed most to the improved performance. CONCLUSION This study has shown that for the classification of disease outbreak reports, a combination of bag-of-words, n-grams and semantic features, in conjunction with feature selection, increases classification accuracy at a statistically significant level compared to previous work in this domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-gram-Based Low-Dimensional Representation for Document Classification

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some sema...

متن کامل

Tion for Document Classification

The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some sema...

متن کامل

Review Classification Using Semantic Features and Run-Time Weighting

We introduce a method for learning to assign suitable sentiment ratings to review articles. In our approach, reviews are transformed into collections of n-gram and semantic word class features aimed at maximizing the probability of classifying them into accurate ratings. The method involves automatically segmenting review articles into sentences and automatically estimating associations between...

متن کامل

Enhancing Sensitivity Classification with Semantic Features Using Word Embeddings

Government documents must be reviewed to identify any sensitive information they may contain, before they can be released to the public. However, traditional paper-based sensitivity review processes are not practical for reviewing born-digital documents. Therefore, there is a timely need for automatic sensitivity classification techniques, to assist the digital sensitivity review process. Howev...

متن کامل

Learning Semantic Tags from Big Data for Clinical Text Representation

In clinical text mining, it is one of the biggest challenges to represent medical terminologies and n-gram terms in sparse medical reports using either supervised or unsupervised methods. Addressing this issue, we propose a novel method for word and n-gram representation at semantic level. We first represent each word by its distance with a set of reference features calculated by reference dist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • International journal of medical informatics

دوره 78 12  شماره 

صفحات  -

تاریخ انتشار 2009